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Abstract 

We developed a novel parallel algorithm for particle filtering (and learning) which is 
specifically designed for GPUs (graphics processing units) or similar parallel computing 
devices. In our new algorithm, a full cycle of particle filtering (computing the value of 
the likelihood for each particle, constructing the cumulative distribution function (CDF) 
for resampling, resampling the particles with the CDF, and propagating new particles for 
the next cycle) can be executed in a massively parallel manner. One of the advantages of 
our algorithm is that every single numerical computation or memory access related to the 
particle filtering is executed solely inside the GPU, and no data transfer between the CPU's 
device memory and the CPU's host memory occurs unless it is under the absolute necessity 
of moving generated particles into the host memory for further data processing, so that it 
can circumvent the limited memory bandwidth between the GPU and the CPU. To demon- 
strate the advantage of our parallel algorithm, we conducted a Monte Carlo experiment in 
which we applied the parallel algorithm as well as conventional sequential algorithms for 
estimation of a simple state space model via particle learning, and compared them in terms 
of execution time. The results showed that the parallel algorithm was far superior to the 
sequential algorithm. 
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1 Introduction 



The state space model (SSM) has been one of the indispensable tools for time series analysis 
and optimal control for decades. Although the archetypal SSM is linear and Gaussian, the liter- 
ature of more general non-linear and non-Gaussian SSMs have been rapidly growing in the last 
two decades. For lack of an analytically tractable way to estimate the general SSM, numerous 
approximation methods have been proposed. Among them, arguably the most widely applied 
method is particle filtering (Gordon et al. (1993), Kitagawa (1996)). Particle filtering is a type 
of sequential Monte Carlo method in which the integrals we need to evaluate for filtering are 
approximated by the Monte Carlo integration. To improve numerical accuracy and stability 
of the particle filtering algorithm, various extensions such as the auxiliary particle filter (Pitt 
and Shephard (1999)) have been proposed, and still actively studied by many researchers. For 
SSMs with unknown parameters, Kitagawa (1998) proposed a self-organizing state space mod- 
eling approach in which the unknown parameters are regarded as a subset of the state variables 
and the joint posterior distribution of the parameters and the state variables is evaluated with 
a particle filtering algorithm. Other particle filtering methods that can simultaneously estimate 
parameters have been proposed by Liu and West (2001), Storvik (2002), Fearnhead (2002), Pol- 
son et al. (2008), Johannes and Poison (2008), Johannes et al. (2008), Carvalho et al. (2010), 
just to name a few. These particle filtering methods that estimate state variables and parameters 
simultaneously are often called particle learning methods in the literature. Although the ef- 
fectiveness of particle filtering methods have been proven through many different applications 
(see Zou and Chakrabarty (2007), Mihaylova et at. (2007), Chai and Yang (2007), Montemerlo 
et al. (2003), Dukic et al. (2009), and Lopes and Tsay (201 1) among others), it is offset by the 
fact that it is a time-consuming technique. Some practitioners still shy away from using it in 
their applications because of this. 

This attitude toward particle filtering would be changed by the latest technology: paral- 
lel computing. As we will discuss in Section 2, some parts of the particle filtering procedure 
are ready to be executed simultaneously on many processors in a parallel computing environ- 
ment. In light of inexpensive parallel processing devices such as GPGPuil] (general purpose 

'a high-performance GPU (graphics processing unit) was originally developed for displaying high-resolution 
2D/3D graphics necessary in video games and computer-aided design. Because a GPU is designed with a massive 



2 



graphics processing units) available to the general public, more and more researchers start to 
jump on the bandwagon of parallel computing. Lee et al. (2010) reviewed general attempts 
at parallelization of Bayesian estimation methods. Durham and Geweke (2011) developed a 
sequential Monte Carlo method designed for GPU computing and applied it to complex non- 
linear dynamic models, which are numerically intractable even for the Markov chain Monte 
Carlo method. As for parallelization of particle filtering, a few researches (see Montemayor 
et al. (2004), Maskell et al. (2006), Hendeby et al. (2007) for example) have been reported, 
though the field is still in a very early stage. To the best of the authors' knowledge, Hendeby et 
al. (2010) is the only successful implementation of the particle filtering algorithm on the GPU. 
Their implementation, however, depends on device- specific functionalities and its resampling 
algorithm is not an exact one. 

In developing parallel algorithms designed for the GPU, there are a few bottlenecks one 
should avoid. First, processing sequential algorithms on the GPU can be inefficient because of 
the CPU's device memory architecture. Roughly speaking, a GPU has two types of memory: 
memory assigned to each core and memory shared by all cores. Access to the core-linked mem- 
ory is fast while access to the shared memory takes more time. Ideally, one should try as much 
to keep all calculations on each core without any communications among cores. The second 
bottleneck is that it is time-consuming to transfer memory between the host memory, which the 
CPU uses, and the device memory, which the GPU uses. In other words the bandwidth between 
the CPU's device memory and the CPU's host memory is very narrow. Thus an ideal parallel 
algorithm for the GPU would be to calculate everything within the GPU, preferably within each 
core (without inter-core communications). 

With these bottlenecks in mind, we have developed a new parallel algorithm that computes 

the full cycle of the particle filtering algorithm in a massively and fully parallel manner, from 

computing the likelihood for each particle, constructing the cumulative distribution function 

(CDF) for resampling, resampling the particles with the CDF, and propagating new particles 

number of processor cores to conduct single-instruction multiple-data (SIMD) processing, it has been regarded 
as an attractive platform of parallel computing and researchers started to use it for high-performance computing. 
As GPU manufacturers try to take advantage of this opportunity, it has evolved into a more computation-oriented 
device called GPGPU. Nowadays almost all GPUs have more or less capabilities for parallel computing, so the 
distinction between GPUs and GPGPUs are blurred. 
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for the next cycle. By keeping all of our computations within the GPU and avoiding all memory 
transfer between the GPU and the CPU during the execution of the particle filtering algorithm. 
In this way, we exploit the great benefits of parallel computing on the GPU while avoiding its 
short comings. Additionally, since our parallel algorithm does not utilize any device- specific 
functionalities, it can be easily implemented on other parallel computing devices including 
cloud computing systems. 

In order to compare our new parallel algorithm with conventional sequential algorithms, 
we conducted a Monte Carlo experiment in which we applied the competing particle learn- 
ing algorithms to estimate a simple state space model (stochastic trend with noise model) and 
recorded the execution time of each algorithm. The results showed that our parallel algorithm 
on the GPU is far superior to the conventional sequential algorithm on the CPU by around 30x 



The organization of this paper is as follows. In Section 2, we briefly review state space 
models and particle filtering and learning. In Section 3, we describe how to implement a fully 
parallelized particle filtering algorithm, in particular how to parallelize the CDF construction 
step and the resampling step. In Section 4, we report the results of our Monte Carlo experiment 
and discuss their implications. In Section 5, we state our conclusion. 



where p{yt\xt) stands for the conditional distribution or density of observation yt given unob- 
servable Xt and p{xt\xt-i) stands for the conditional distribution or density of Xt given Xt-i, 
which is the previous realization of Xt itself. In the literature of SSM, unobservable Xt, which 
dictates the stochastic process of yt, is called the state variable. 

Time series data analysis with the SSM is centered on how to dig up hidden structures of the 
state variable out of the observations {yt}f=i- In particular, the key questions in applications 
of SSM are (i) how to estimate the current unobservable xt, (ii) how to predict the future 



to 200x. 



2 State Space Models and Particle Filtering 



A general form of SSM is given by 




(1) 
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state variables, and (iii) how to infer about the past state variables with the data currently 
available. These aspects of state space modeling are called filtering, prediction, and smoothing, 
respectively. 

The filtering procedure, which is the main concern in our study, is given by the sequential 
Bayes filter: 



p{yt\xt)p{xt\yi:t-i) 
J p{yt\xt)p{xt\yi:t-i)dxt 



where yi:t — {yi, . . . , yt} = 1, . . . , T") and p{xt\yi:t) is the conditional density of the state 
variable Xt given yi-t. In essence, equation (3) is the well-known Bayes rule to update the 
conditional density of while equation (2) is the one-period-ahead predictive density of Xt 
given the past observations yi-.t-i- By applying (2) and (3) repeatedly, one keeps the conditional 
density p{xt\yi:t) updated as a new observation comes in. 

In general, a closed-form of neither (2) nor (3) is available, except for the linear Gaussian 
case where we can use the Kalman filter (Kalman (I960)). See West and Harrison (1997) on 
detailed accounts about the linear Gaussian SSM. To deal with this difficulty, we apply particle 
filtering, in which we approximate the integrals in (2) and (3) with particles, a random sample 
of the state variables generated from either the conditional density p{xt\yi:t) or the predictive 
density p{xt\yi:t-i)- Let {xl^^}^i denote N particles generated from p{xt\yi:t-i)- We can 
approximate p(xt I by 

N 

N 



p{xt\yi:t-i) -^^H^t- xi'^) (4) 

i=l 



where 6{-) is the Dirac delta. Then the filtering equation (3) is approximated by 

1 ^ 

1=1 

J P{yt\xt)j^ ^ 5{xt - xf)dxt 



N 

P{xt\yi.,t) ^ '-^ = wi"'^6{xt - x["^), (5) 



wl' 



p{yt\x. 



1=1 



Y.i=iP{yt\xt) 
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(5) implies that the conditional density p(xt | yi:t) is discretized on particles {x^^ }fLi with proba- 
bilities {w^^^jfL^. Therefore we can obtain a sample of Xt, {x[^^}fLi, fromp{xt\yi:t) by drawing 
each x^^ out of {x['^ }fL-^^ with probabilities {wf^ when the approximation (5) is sufficiently 
accurate. This procedure is called resampling. In reverse, if we have N particles {x^li}f=i gen- 
erated from p{xt-i\yi:t-i), we can approximate by 

pixt\yi:t-i) ^ / p{xt\xt-i)—'^5{xt-i - xfl^)dxt-i = — ^p{xt\x[-i)- (6) 

i=l i=l 

Then (6) implies that we can obtain a sample of Xt+i, from p(xt+i\yi;t) by gen- 

erating each Xj^]^ from p{xt+i\x[^^), which is called propagation. Hence we can mimic the 
sequential Bayes filter by repeating the propagation equation (6) and the resampling equation 
(5) for i = 1, 2, 3, . . . This is the basic principle of particle filtering. The formal representation 
of the particle filtering algorithm is given as follows. 

Algorithm: particle filtering 
Step 0: Set the starting values of N particles {xq^}^i. 
Step 1: Propagate from p(xt\Xf^li) , (i — 1,. . ., N). 
Step 2: Compute weight oc ^). 
Step 3: Resample {xl'^jfL-^ from {xf^jfL-^ with weight wf-*. 
When a state space model depends on unknown parameters 9, 

yt^p(yt\xt,0) 

(7) 

Xt ^ p{xt\xt-u9) 

we need to evaluate the posterior distribution p{9\yi;t) given the observations yi:t. In the 
framework of particle filtering, p{9\yi;t) is sequentially updated as a new observation arrives, 
which is called particle learning. The particle learning algorithm is defined as follows. Let 
{4'^ = {xi^\9t^)}^i and {z^^ — {xt\9l^^)}f=i denote particles jointly generated from 
p{xt, ^|yi:t-i) and p{xt, ^|yi:t) respectively. Then the particle approximation of the Bayesian 
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learning process (Kitagawa (1998)) is given by 

1 ^ 

p{zt\yi..t-i) ^ J2p(^t\^t-i)^ 

i=l 

Piz,\y,,) f2 ^f'^i^t - z?l w? = Jj^'^'^f^^ . (9) 
This is a rather straightforward generalization of the particle filtering. 

Algorithm: particle learning 

Step 0: Set the starting values of N particles {zo'^}^^ 

Step 1: Propagate from p{zt\zl!_-^) , (i = 1, . . . , A^). 

Step 2: Compute weight wf^ oc p{yt\zf^)- 

Step 3: Resample from {z'f^}f^^ with weight \ 

Once we generate {fi'f ^ } by the particle learning, we can treat them as a Monte Carlo sample 
of 9 drawn from the posterior density p{9\yi;t). Thus we calculate the posterior statistics on 9 
with {ef} in the same manner as the traditional Monte Carlo method or the state-of-the-art 
Markov chain Monte Carlo (MCMC) method. 

The computational burden of particle filtering will be prohibitively taxing as the number of 
particles N increases. The number of the likelihood p{yt\xt^) to be evaluated, the number of 
executions for constructing the CDF of particles, and the number of particles to be generated 
in propagation will increase in 0{N). The number of executions for resampling with the CDF 
will increase in O(iV^) when we use a naive resampling algorithm, but it can be reduced to 
0{N log N) with more efficient algorithms, which we will discuss in the next section. Thus 
sequential particle-by-particle execution of each step in the particle filtering (and learning) al- 
gorithm is inefficient when N is large, even though the particle filtering method by construction 
requires a large number of particles to guarantee precision in the estimation. 

To reduce the time for computation, we propose to parallelize all steps in particle filtering 
so that we can execute the parallelized particle filtering algorithm completely inside the GPU. 
The key to constructing an efficient parallel algorithm is asynchronous out-of-order execution 
of jobs assigned to each processor. We need to keep a massive number of processors in the GPU 
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as busy as possible to fully exploit a potential computational power of the GPU. Therefore each 
processor should waste no milliseconds by waiting for other processors until they complete 
their jobs. If the order of execution is independent of the end results, asynchronous out-of- 
order execution is readily implemented. In this situation, parallelization is rather straightfor- 
ward. In the particle filtering method, this is the case for computation of the likelihood and the 
propagation step and these steps can be computed in parallel without any modifications. For 
constructing the CDF and resampling particles, on the other hand, the conventional algorithm 
does not allow asynchronous out-of-order execution and thus parallelization of these steps can 
be tricky. In order to devise a fully parallelized particle filtering method, we need to develop 
new algorithms for parallelization for these steps. In the next section, we describe how to 
implement the CDF construction and resampling in a parallel computing environment. 

3 Full Parallelization of Particle Filtering 
3.1 Fully parallelized CDF construction 

The goal of resampling is to generate random integers, which are the indices of particles, 
from a discrete distribution on {1, ... , A^} with the cumulative distribution function. 



Therefore we need to construct the CDF (10) before we perform resampling. 

Hendeby et al. (2007) developed an algorithm designed for parallel execution of the CDF 
construction. The process consists of two procedures: the forward adder and the backward 
adder. These are illustrated in Table [TJ Suppose that we have four particles and their weights 
are given by {2, 4, 3, 1}. What we want to compute is the cumulative sum, {2, 6, 9, 10}. First 
we apply the forward adder as described in Panel (a) of Table[I] In the initial step of the forward 
adder, each weight is assigned to a node (nodes are corresponding to threads in the GPU). Let 
wf^ denote the sum of weights of Node i in Step j. Thus the initial states of the nodes are 
w^^^ = 2, w^^ = 4, w[^^ = 3, and w[^'' = 1. Then in each step of the forward adder, a 
neighboring pair of nodes is combined to form a new node which inherits the sum of weights 
from the two combining nodes as its new weight. For example, in Step 2 of Table [Ha), a pair 




(10) 
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of nodes, w^^^ = 2 and w^"^ = 4, is combined into a new node ^2^^ = 2 + 4 = 6. We repeat 
this procedure until all nodes are collapsed into a single node whose weight is the sum of all 
weights. 

Then we apply the backward adder as described in Panel (b) of Table [TJ Note that the 
numbering of steps is reversed in Table \T\ (b) as this is intentional. In the initial step of the 

(i) 

backward adder, we start at the last node in the forward adder. Let denote the sum of weights 
up to Particle i in Step j. In Panel (b) of Table [H we already have s^^'^ = 2 + 4 + 3 + 1 = 10 
by the forward adder. As it moves backward, the current node branches into two nodes in each 
step. The new right node inherits the same value of the current node, while the partial sum of 
the new left node equals the value of the current node minus the weight of the right node in the 
corresponding step of the forward adder. For example, in Step 1 of Table [Jb), the node with 

(2) (1) (2) (2) (2) (2) 

^2 = 6 branches to s\ = S2 — wi = Q — A = 2 and s\ = S2 = Q- Note that the number 
of particles should be an integer power of 2 in order to apply this algorithm. 

Table 1 : Parallel computation of the cumulative sum 



(a) Forward adder 



Step 1 


) = 2 


) = 4 


(3) Q 


wi'^ = 1 


Step 2 


= 2 + 4 = 6 


w^2^ =3 + 1 = 4 


Step 3 


= 6 + 4 = 10 


(b) Backward adder 


Step 3 


4^^ = 10 


Step 2 


sf^ = 10 - 4 = 6 


4^^ = 10 


Step 1 


sji) =6-4 = 2 


4'^ = 6 


^ = 10 - 1 = 9 


4'^ = 10 



3.2 A Review of the Conventional Resampling Algorithms 

Before we proceed to describe our parallel resampling algorithm, we briefly review several 
conventional sequential resampling algorithms. In the most naive manner, the resampling algo- 
rithm is expressed as 

for(i in 1:N) { 

9 



u <- runif ( 1 ) ; 
for ( j in 1 : N) { 
if (u < q[ j] ) 
break; 

} 

sampled [i] <- j; 

} 

Although this is the most basic and exact resampling procedure, it is extremely time-consuming 
0{N'^) operation, as we need to search out each particle one by one in the whole CDF. One 
simple way to improve it is by sorting the uniform variates before the search process starts: 

u <- sort (runif (N) ) ; 
for (i in 1 :N) { 

for ( j in i : N) { 

if(u[j] < q[j]) 
break; 

} 

sampled [i] <- j; 

} 

Compared to the naive algorithm, this algorithm starts off by sampling all necessary random 
numbers from the uniform distribution, sorts them in ascending order, and then sequentially 
resamples particle with them. By sorting the uniform variates, each search for a particle can 
be started where the last research was left off. The offset of this algorithm is that, although 
the resampling procedure is a more efficient 0{N log N) operation, sorting uniform variates 
can be computationally strenuous as the number of particles increase, depending on the sorting 
algorithm we use. 

Alternative resampling algorithms have been introduced in order to circumvent the compu- 
tational strains found in exact resampling algorithms like those above. The stratified resampling 
algorithm, 

for(i in 1:N) { 
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u <- (i-l+runif (1) ) /N; } 
for ( j in { i } : N) { 
if (u < q[ j] ) 
break; 

} 

sampled [i] <- j; 

} 

conducts the resampling procedure by generating uniform variates on N equally spaced inter- 
vals [{i — 1)/N,i/N] {i — 1,. . . ,N). Since only one particle will always be picked for each 
interval, it does not exactly generate random integers from the the CDF {q{l), . . . ,q{N)}. The 
systematic resampling, 

uO <- runif ( 1 ) ; 
for (i in 1 :N) { 

u <- (i-l+uO) /N: 
for ( j in i : N) { 
if(u < q[j]) 
break; 

} 

sampled [i] <- j; 

} 

is similar to the stratified resampling but it always chooses an identical point in[{i — l)/N,i/N] 
for all i. In essence, the two alternative resampling algorithms are similar to the aforementioned 
resampling with sorted uniform variates, but without the time-consuming sorting procedure. 
However, neither stratified sampling nor systematic sampling is exact. Therefore, they may 
produce less accurate results compared to exact resampling algorithms. In particular when es- 
timating unknown parameters in particle learning, they may be more problematic because they 
cannot jointly resample the state variables and the parameters. Thus, despite their computa- 
tional superiority in a sequential framework, we need to be careful of using them in practice. 
Additionally, as these methods benefit only from the sequential framework, it is not obvious 
how to parallelize them. 
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3.3 Fully parallelized resampling 



To parallelize the resampling 



procedure while maintaining its exactness, we have developed a 



parallel resampling algorithrro based on the cut-point method by Chen and Asau (1974). 

A cut-point, Ij, for given j = 1, . . . , is the smallest index i such that the corresponding 
probability q{Ij) should be greater than (j — 1)/A^. In other words, 

j - I 

q{Ij) = udn^q{i) subject to q{i) > — (j = 1, • • • , A^)- (H) 

Given cut-points {/i, . . . , Jat}, random integers between 1 and A^ is generated from the CDF 
{g(l), . . . , q{N)} by the following procedure: 



Algorithm: cut-point method 

Step 0: Let j = 1 

Step 1: Generate u from the uniform distribution on the interval (0, 1). 

Step 2: Let k = I\nu] where \x~\ stands for the smallest integer greater than or equal to x. 

Step 3: If M > q{k), let /c /c + 1 and repeat Step 3; otherwise, go to Step 4. 

Step 4: Store k as the index of the particle. 

Step 5: If j < N, let j ^ j + 1 and go back to Step 1; otherwise, exit the loop. 

Once all cut-points {/i, . . . , 1^} are given, parallel execution of the cut-point method is 
straightforward because the execution of Step 1 - Step 3 does not depend on the index j. The 
fully parallel resampling algorithm distributed on A^ threads is given as follows. 



Algorithm: parallelized cut-point method 

Step 0: Initiate the j-th thread. 

^Hendeby et al. (2007) developed a parallel resampling algorithm for particle filtering which is specifically 
designed for CPUs. Their method, however, is dependent on a device specific functionahty (rasterization) and its 
efficiency and scalability is hmited by the GPU architecture. Our parallel algorithm, on the other hand, is more 
versatile and scalable because it requires only basic thread coordination mechanisms such as shared memory and 
thread synchronization which are provided by most parallel computing systems. 
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Step 1: Generate u from the uniform distribution on the interval (0, 1). 

Step 2: Let k = I\nu'\ where \x\ stands for the smallest integer greater than or equal to x. 

Step 3: If M > q{k), let /c ^ /c + 1 and repeat Step 3; otherwise, go to Step 4. 

Step 4: Store k as the index of the particle. 

Step 5: Wait until all threads complete the job. Otherwise, exit the loop. 

However, the conventional algorithm for computation of the cut-points (see Fishman (1996, 
p. 158) for example) is not friendly to parallel execution. Thus, we have developed an efficient 
algorithm for parallel search of all cut-points. To devise such a search algorithm, let us intro- 
duce 

L, = \Nq{j)\ (j = l,...,iV). 

and Lo = for convention. Due to the monotonicity of the cumulative distribution function, 
we observe 

1. = Lo < Li < • • • < Ljv = A^. 

2. If Lj_i < Lj, a cut-point such that 

= ^min^g(i) subject to Nq{i) > k — 1, {k = Lj_i + 1, . . . , Lj) 

is given as 4 = j. 

3. If Lj_i — Lj, j is not corresponding to any cut-points. 

4. Li — Ii always holds. 

The above properties give us a convenient criterion to check whether a particular Lj is a cut- 
point or not and it leads to the following multi-thread parallel algorithm to find all cut-points. 

Algorithm: parallelized cut-point search 
Step 0: Initiate the j-th thread. 
Step 1: Compute Lj = \Nq(j)]. 
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step 2: Let k = Lj. 

Step 3: If A; > let Jfc = j and end the thread. 

Step 4: Let k ^ k — 1 and go to Step 3. 

With a fully parallel resampling algorithm, particle filtering can be executed in a fully par- 
allel manner without any compromise. Additionally, as the particle filtering algorithm (and the 
particle learning algorithm) is conducted completely on the GPU and each particle goes through 
the algorithm on each designated core without syncing, the advantage of parallel computing is 
gained to the fullest while its shortcoming is kept to its minimum (data transfer between the 
GPU's device memory and the CPU's host memory only occurs at the beginning and the end 
of the computation). 

4 Numerical Experiment 

In our experiment, we use a stochastic trend with noise model: 



as the benchmark model for performance comparison. In (fT2l) . we set xq = 0, cr^ = 1, = 0.1, 
and generate {yi, . . . , ?/ioo}- Then we treat and as unknown parameters and apply the 
particle learning algorithm by Carvalho et al. (2010) to (fT2l) . The prior distributions are 



To demonstrate the effectiveness of our new parallel algorithm, we will compare the fol- 
lowing types of algorithms: 

• Sequential algorithm on the CPU 

CPU(n): Naive resampling with single precision 

CPU(s): Resampling with sorted uniform variates with single precision 

• Parallel algorithm on the GPU 




(12) 



~7V(0, 10), (T^ ~ 
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GPU(sp): Parallel resampling by the cut-point method with single precision 
GPU(dp): Parallel resampling by the cut-point method with double precision 

The first two are conventional sequential algorithms for resampling. The code for the parallel 
algorithm is written in CUDA while that for the sequential algorithms is in C. Both are com- 
piled and executed on the same Linux PC with specifications shown in Table [21 Alternative 
resampling algorithms, such as the stratified and systematic resampling, are not considered 
here, as they are not exact resampling. However, if we exclude the time consumed by the sort- 
ing procedure from the resampling time of CPU(s), we get a very good estimate of how long 
they might take. 

Table 2: Hardware specifications 

CPU (Intel Core-i7 2700K) GPU (NVIDIA GTX580) 

3.50GHz 772MHz 

4 512 

8GB 3GB 



Core clock rate 
Number of cores 
Memory 



For each algorithm, we execute the particle learning ten times with the same generated path, 
{Ui, . . . , yioo}, and recorded the execution time of each trial. To avoid the influence of possible 
outliers, we took the average of the middle five of them. The results are listed in Table [3] and 
the plots of the total execution time against the number of particles are shown in Figure [TJ 

The results clearly show that our new parallel algorithm, which runs completely in parallel 
and keeps all executions within the GPU, can be extremely effective compared to conventional 
sequential algorithms. As the number of particles increases (and the precision of the estimate 
increases), GPU(sp) outperforms that of CPU(n) by more than 200x in the case of 1,048,576(= 
2^°) particles and l,671x in the case of 8,388, 608(= 2^^) particles. Even when compared with 
the exact resampling with sorted uniform variates {CPU(s)), GPU(sp) is consistently faster by 
more than 20x when the number of particles is 131,072 or more. In the comparison between 
GPU(sp) and GPU(dp), the difference is somewhere around two times, which is consistent 
with intuition. Interestingly, the computation on the GPU in double precision is still a good 
5-lOx faster than that of the CPU in single precision, which demonstrates the sheer power of 
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Table 3: Comparison in execution time 





Number of particles 


Time 


1,024 


4,096 


16,384 


131,072 


1,048,576 


8,388,608 


(i) CPU(n) 


30 


80 


390 


7,050 


307,600 


19,855,870 


(ii) CPU(s) 


48 


188 


754 


6,088 


49,266 


330,742 


(iii) GPU(sp) 


17 


21 


44 


215 


1,526 


11,881 


(iv) GPU(dp) 


20 


25 


129 


436 






Ratio 


1,024 


4,096 


16,384 


131,072 


1,048,576 


8,388,608 


(i)/(iii) 


1.7 


3.8 


7.7 


32.7 


201.5 


1671.3 


(ii)/(iii) 


2.8 


9.0 


17.3 


28.3 


32.3 


27.8 


(ii)/(iv) 


2.4 


7.4 


5.8 


14.0 






(iv)/(iii) 


1.1 


1.2 


3.0 


2.0 







Note: the values of execution time are in milliseconds. 



■ - ^ -CPU(n) 
□ CPU(s) 
7 ^:^GPU(sp) 
" r | -A- GPU(dp) 




lO'' lO" 10^ 10^ 10^ 



Number of particles 

Figure 1 : Plots of execution time against the number of particles 
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parallel processing on the GPU. Due to memory failure, GPU (dp) failed for particles more than 
a million, though this could be remedied by upgrading the GPU to the one with more memory 
or using multiple GPUs. 

To see which part of the particle learning contributes to the reduction in execution time, we 
divide the cycle of particle learning into the following steps; 

Initialize: set the starting values of the particles; 

CDF: compute the likelihood and construct the CDF; 

Resample: resample the particles with the CDF; 

Propagate: propagate a new set of particles; 

Store: store the generated particles into the CPU's host memory (GPU only); 
Other: keep the results and proceed with the particle learning; 

The results in the case of 131,972 particles are listed in Table HI The tendency we observe in 
Table mis similar in the other cases. 

Breaking down the execution time gives us deep insights into how the GPU architecture 
works and its strong and weak points. Examining the results in CPU(s), we first notice that the 
CDF step and Propagation step put together occupy the bulk of the total execution time, while 
the Resampling step only accounts for less than ten percent of the total execution time and much 
of it coming from the sorting step. Looking closely into the gains by parallelization in each step, 
the largest comes from the CDF step with a gain of 248x, followed by the Propagation step 
with a gain of 45. 3x, then followed by the Resampling step with a gain of 1 1.9x. Although the 
gain in the Resampling step has less of an impact compared with the overwhelming gain in CDF 
and Propagation, it does not overshadow the fact that it gained 2.7x in single precision even if 
we ignore the time spent in sorting the uniform random variates and focus on the resampling 
procedure only. That implies that our parallel resampling on the GPU can beat the stratified 
resampling on the CPU since the stratified resampling is roughly equivalent to the resampling 
with sorted uniform variates without sorting in terms of computational complexity. As for 
the Other step, CPU(s) and GPU(sp) is identical. This is because for both algorithms, all 
executions of this step are conducted only on the CPU. Thus, we observe no difference. Finally, 
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Table 4: Breakdown of execution time 







Cycle of particle learning 




Time 


Initialize 


CDF 


Resample 


Propagate 


Store 


Other 


Total 


(i) CPU(s) 


26 


2652 


454 

(102) 


2900 




56 


6088 


(ii) GPU(sp) 


0.72 


11 


38 


64 


46 


56 


215 


(iii) GPU(dp) 


1.51 


19 


61 


174 


89 


92 


436 


Ratio 


Initialize 


CDF 


Resample 


Propagate 


Store 


Other 


Total 


(i)/(ii) 


35.9 


248.0 


11.9 
(2.7) 


45.3 




1.0 


28.3 


(iii)/(ii) 


2.1 


1.8 


1.6 


2.7 


1.9 


1.7 


2.0 



Notes: (a) the number of particle is 131,072; 

(b) the values of execution time are in milliseconds; 

(c) the number in parentheses is corresponding to the time excluding 
the sorting step. 
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we observe a good amount of reduction in initiating the particle learning algorithm by our 
parallel algorithm; however, the time spent in initiation is quite trivial, in particular when the 
number of the sample period T is large (T = 100 in our experiment). 

Although it is clear that our parallel algorithm is superior to the conventional sequential 
algorithm through every step. Table |4] indicates that there is one drawback of using the GPU. 
That is memory transfer. The Store step measures the time it takes to transfer the generated 
particles from the GPU's device memory to the CPU's host memory. Table |4] shows that it 
takes up roughly 15-20% of the execution time. Note that, for fairness of the experiment, the 
GPU returns all of the particles it generated to the CPU's host memory. If we were to return 
only the mean, the variance, and other statistics of the state variables and parameters, the time 
for the Store step can be cut down significantly. 

5 Conclusion 

In this study, we have developed a new algorithm to perform particle filtering and learning in 
a parallel computing environment, in particular on GPGPUs. Our new algorithm has several 
advantages. First, it enables us to keep all executions of the particle filtering (and learning) al- 
gorithm within the GPU so that data transfer between the GPU's device memory and the CPU's 
host memory is minimized. Second, unlike the stratified sampling or the systematic sampling, 
our parallel sampling algorithm based on the cut-point method can resample particles exactly 
from their CDF. Lastly, since our algorithm does not utilize any device specific functionalities, 
it is straightforward to apply our algorithm to a multiple GPU system or a large grid computing 
system. 

Then we conducted a Monte Carlo experiment in order to compare our parallel algorithm 
with conventional sequential algorithms. In the experiment, our algorithm implemented on the 
GPU yields results far better than the conventional sequential algorithms on the CPU. Although 
we keep the SSM as simple as possible in the experiment, our parallel algorithm can also be 
applied to more complex models without any fundamental modifications to the programming 
code and this little investment will return a significant gain in execution time instantaneously. 

Our fully parallelized particle filtering algorithm is beneficial for various applications that 
require estimating powerful but complex models in a shorter span of time; ranging from motion 
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tracking technology to high-frequency trading. We even envision that one can perform real-time 
filtering of the state variables and the unknown parameters in a high-dimensional nonlinear non- 
Gaussian SSM on an affordable parallel computing system in a completely parallel manner. 
That would pave the way for a new era of computationally intensive data analysis. 
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